Import the datasets and libraries, check datatype, statistical summary, shape, null values or incorrect imputation. (5 marks)
EDA: Study the data distribution in each attribute and target variable, share your findings (20 marks)
Number of unique in each column?
Number of people with zero mortgage?
Number of people with zero credit card spending per month?
Value counts of all categorical columns.
Univariate and Bivariate
Get data model ready
Split the data into training and test set in the ratio of 70:30 respectively (5 marks)
Use the Logistic Regression model to predict whether the customer will take a personal loan or not.
Find out
Give conclusion related to the Business understanding of your model? (5 marks)
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#pd.options.display.float_format = '{:,.2f}'.format
# List .csv file
!ls *.csv
# Import dataset
df = pd.read_csv('Bank_Personal_Loan_Modelling.csv')
# check data type
df.info()
# Shape
df.shape
# Null values
df.isna().sum()
Insights:
# Statistical Summary
df.describe()
Insights:
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof
Insights:
df.apply(lambda x: x < 0).sum()
Insights:
df[ df['Experience'] < 0 ].describe()
Insights:
sns.countplot( 'Personal Loan' , data=df[(df['Age'] < 30)] , hue='Experience')
Number of unique in each column?
Number of people with zero mortgage?
Number of people with zero credit card spending per month?
Value counts of all categorical columns.
Univariate and Bivariate
Get data model ready
df.nunique()
Insights:
df[ df['Mortgage'] == 0 ]['ID'].count()
# As a percentage
df[ df['Mortgage'] == 0 ]['ID'].count()/df.shape[0] * 100
df[ df['CCAvg'] == 0 ]['ID'].count()
# As a percentage
df[ df['CCAvg'] == 0 ]['ID'].count()/df.shape[0] * 100
categorical_columns = ['Family','Education','Securities Account','CD Account','Online','CreditCard']
for col in categorical_columns:
print("** Column = ", col)
print(df[col].value_counts(normalize=True))
print()
df.info()
# Distplots of continuos independent variables
for col in ['Age' , 'Experience', 'Income','CCAvg','Mortgage']:
sns.distplot(df[col],kde=True)
plt.show()
Insights:
# Count plot of categorical independent variables
for col in categorical_columns:
sns.countplot(df[col])
plt.show()
sns.countplot(df['Personal Loan'])
Insights:
sns.heatmap(df.corr())
Insights:
sns.boxplot(x='Personal Loan',y='Age',data=df)
sns.boxplot(x='Personal Loan',y='CCAvg',data=df)
sns.boxplot(x='Personal Loan',y='Income',data=df)
sns.boxplot(x='Personal Loan',y='Mortgage',data=df)
Insights:
for col in categorical_columns:
sns.countplot(x=col, data=df, hue='Personal Loan')
plt.show()
Insights:
df.shape
num_records_with_negative_years_experience = df[ df['Experience'] < 0 ]['Personal Loan'].count()
num_records_with_negative_years_experience
# Find the median of all records with non-negative Experience
experience_median = df[ df['Experience'] >= 0 ]['Experience'].median()
experience_median
# Replace negative experience values with experience_median
df['Experience'] = df['Experience'].apply( lambda x: x if x > 0 else experience_median)
df.describe()['Experience']
df_cleaned = df
df_cleaned.describe()
df_cleaned['Personal Loan'].value_counts(normalize=True)
categorical_columns_one_hot = ['Education']
df_cat=pd.get_dummies(df_cleaned,columns=categorical_columns_one_hot)
df_cat.head()
df_cat['ZIP Code'].nunique()
zipcode_to_personal_loans = df_cat.groupby('ZIP Code').sum()['Personal Loan']
zipcode_to_personal_loans[ df_cat['ZIP Code'] ]
# Create a dictionary that maps Number of Loans to ZIP Code
map_dict = dict(zipcode_to_personal_loans )
# Create New colum Loans-By-Zipcode
df_cat['Loans-By-Zipcode'] = df_cat['ZIP Code'].apply( lambda x : map_dict[x] )
# Quick visualization of this newly created column
sns.countplot('Loans-By-Zipcode' ,data=df_cat)
Insights:
df_cat.head()
y = df_cat['Personal Loan']
# Drop ID field, that is just a unique identifier for every record
# Drop ZIP Code field, now that we have extracted the information we need
# Drop Personal Loan (y dependent variable)
df_cat.drop(['ID','ZIP Code','Personal Loan'],inplace=True,axis='columns')
df_cat.head()
X = df_cat
This completes the Data model preparation
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
y_train.value_counts(normalize=True)
y_test.value_counts(normalize=True)
Insights:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_auc_score,accuracy_score
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=5463)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
## function to get confusion matrix in a proper format
def draw_cm( actual, predicted ):
cm = confusion_matrix( actual, predicted)
plt.figure(figsize=(15, 10))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
print("Training accuracy",model.score(X_train,y_train))
print()
print("Testing accuracy",model.score(X_test, y_test))
print()
print('Confusion Matrix')
draw_cm(y_test,y_pred)
print()
print("Recall:",recall_score(y_test,y_pred))
print()
print("Precision:",precision_score(y_test,y_pred))
print()
print("F1 Score:",f1_score(y_test,y_pred))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_pred))
# Null accuracy , choose 1 of the classes all the time
# Useful metric when we have an imbalance in class distribution
def compute_null_accuracy(y):
distrib = y.value_counts(normalize=True)
return distrib[ np.argmax(distrib) ] * 100
print( "Null accuracy = {}% ".format(compute_null_accuracy(y_test)))
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, model.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:,1])
plt.figure(figsize=(15, 10))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.grid('on')
plt.show()
Insights:
accuracy_plot=[]
recall_plot=[]
precision_plot=[]
x_range = np.linspace(0.1,0.9,100)
for threshold in x_range:
y_pred = np.where(model.predict_proba(X_test)[:,1] > threshold ,1, 0)
accuracy_plot.append( accuracy_score(y_test,y_pred))
recall_plot.append( recall_score(y_test,y_pred))
precision_plot.append( precision_score(y_test,y_pred))
plt.figure(figsize=(15, 10))
plt.plot(x_range, accuracy_plot,label="accuracy")
plt.plot(x_range, recall_plot,label='recall')
plt.plot(x_range, precision_plot,label='precision')
plt.legend(loc="lower right")
plt.grid('on')
plt.show()
Threshold = 0.3
y_pred = np.where(model.predict_proba(X_test)[:,1] > Threshold ,1, 0)
print("Testing accuracy",accuracy_score(y_test, y_pred))
print()
print('Confusion Matrix')
draw_cm(y_test,y_pred)
print()
print("Recall:",recall_score(y_test,y_pred))
print()
print("Precision:",precision_score(y_test,y_pred))
print()
print("F1 Score:",f1_score(y_test,y_pred))
print()
print("Roc Auc Score:",roc_auc_score(y_test,y_pred))
## Feature Importance or Coefficients
fi = pd.DataFrame()
fi['Col'] = X_train.columns
fi['Coeff'] = np.round(abs(model.coef_[0]),2)
fi.sort_values(by='Coeff',ascending=False)
# Observe the predicted and observed classes in a dataframe.
z = X_test.copy()
z['Actual-Output'] = y_test
z['Predicted-Output'] = y_pred
filter_actual_not_equal_predicted = ( z['Actual-Output'] != z['Predicted-Output'])
z[filter_actual_not_equal_predicted]
Confusion matrix means¶True Positive (observed=1,predicted=1):
Predicted that the customer will accept Personal Loan, and the Customer did accept.
False Positive (observed=0,predicted=1):
Predicted that the customer will accept Personal Loan, but the Customer did not accept.
True Negative (observed=0,predicted=0):
Predicted that the customer will not accept Personal Loan, and the Customer did not accept.
False Negative (observed=1,predicted=0):
Predicted that the customer will not accept Personal Loan, and the Customer did accept.
Important Features¶"CDAccount", "Education", "Securities Account", "Online", "Family", "Credit Card"
seems to be top 56 features which influence the model's output. Based on the coefficients value.
Important Metric¶False Negatives represent Business Opportunities lost, i,e the customer would have taken the Personal Loan if they were offered it.
False Positives represent waste of Marketing dollars, we offered the Personal Loan but the customer did not take it.
I would bias the model towards a lower number of False Negatives, because we can assume that the bank would make far more profit by having customers accept a Personal Loan, than the incremental marketing cost. Lower False Negatives means Recall is the important metric. We want to increase Recall without impacting accuracy.
Moving the threshold from 0.5 to 0.3 accomplishes that while maintaining the same accuracy of the model at ~95%